Analyzing Healthcare Provider Shortage - Part 2/4

In this part of the study, we will explore geocoded provider data with respect to socioeconomic and demographic factors such as population density, median income, median age, population with health insurance etc. We will study the distribution of:

  • Overall providers in the U.S.
  • Explore Texas to find out how healthcare providers are distributed at a county level

In this section, we will:

  • Get Geocoded data and check geocoding results
  • Gather and Process Demographic and Health Expenditure Data
  • Aggregate Provider Data at the County Level
  • Explore distribution of Heakthcare Providers in U.S.
  • Explore Texas and study how:
    • Provider Count varies with Population Density, Median Income and Median Age

Part 2: Provider Shortage - Data Exploration

In [1]:
# Import Libraries
from IPython.display import display

# Import arcgis
import arcgis
from arcgis.gis import GIS
from arcgis.features import FeatureLayer
from arcgis.mapping import WebMap

# Import libraries for data exploration
import pandas as pd
pd.set_option('display.max_columns', 500)
import numpy as np

# Import plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# Import library for time
import time
In [2]:
# Create a GIS connection
gis = GIS("https://datascienceqa.esri.com/portal", "portaladmin", "esri.agp", verify_cert=False)
# gis = GIS("https://datascienceqa.esri.com/portal", "portaladmin", "esri.agp")

Get Geocoded Provider Data

Provider data was geocoded using the GeoAnalytics server. Let's get the geocoded provider data feature layer for exploration.

In [3]:
# Search the feature layer
search_result = gis.content.search('title: provider_data_geocoded_7_30', 'Feature Layer')
search_result
Out[3]:
[<Item title:"provider_data_geocoded_7_30" type:Feature Layer Collection owner:portaladmin>]
In [4]:
# Get the feature layer item
provider_data_item = search_result[0]
provider_data_item
Out[4]:
provider_data_geocoded_7_30
geocoded healthcare provider dataFeature Layer Collection by portaladmin
Last Modified: September 13, 2019
0 comments, 2 views
In [5]:
# Check for layers inside the item
provider_data_item.layers
Out[5]:
[<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/provider_data_geocoded_7_30/FeatureServer/0">]
In [6]:
# Get the layer needed for analysis
provider_data_layer = provider_data_item.layers[0]
provider_data_layer
Out[6]:
<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/provider_data_geocoded_7_30/FeatureServer/0">
In [7]:
# Look at the first 5 fields and their data types
for f in provider_data_layer.properties.fields[:5]:
    print(f['name'],'      ',f['type'])
objectid        esriFieldTypeOID
user_npi        esriFieldTypeDouble
user_entity_type        esriFieldTypeString
user_address        esriFieldTypeString
user_address2        esriFieldTypeString
In [13]:
# Check number of records in `provider_data_layer`
provider_data_layer.query(return_count_only=True)
Out[13]:
5790417

Check Geocoding

Let's do a random test to check accuracy of geocoded data. For a given state, we will look at and plot the data points based on address provided vs address geocoded.

Geocoding Check: Wyoming

In [14]:
# Create a spatially enabled dataframe for WY
%time wy_df = provider_data_layer.query(where="user_Region='WY'", as_df=True)
# %time wy_df = provider_data_layer.query(where="USER_Provider_Business_Practice_Location_Address_State_Name='WY'", as_df=True)

wy_df.shape
Wall time: 16.3 s
Out[14]:
(12770, 32)
In [15]:
# Look at data points that are geocoded outside of Wyoming
len(wy_df[wy_df['region']!='Wyoming'])
Out[15]:
3
In [44]:
# Check the accuracy of geocoding
round(100-((wy_df.shape[1]/wy_df.shape[0])*100),2)
Out[44]:
99.29
In [9]:
# Create a map
map1 = gis.map('USA')
map1
In [10]:
# Plot data points for WY
wy_df.spatial.plot(map1)
Out[10]:
True
In [11]:
map1.take_screenshot()
In [36]:
map1.remove_layers()
Out[36]:
True

We have an accuracy of >99% with geocoding. From this map, we can see that out of 12770 points that have been geocoded only 3 are outside of WY. Similar analysis can be performed for other states to check for geocoding errors.

Geocoding Check: Arizona

In [12]:
# Create a spatially enabled dataframe for AZ
%time az_df = provider_data_layer.query(where="user_Region='AZ'", as_df=True)
# %time az_df = provider_data_layer.query(where="USER_Provider_Business_Practice_Location_Address_State_Name='AZ'", as_df=True)

az_df.shape
Wall time: 16min 22s
Out[12]:
(108769, 91)
In [15]:
# Look at data points that are geocoded outside of Arizona
len(az_df[az_df['Region']!='Arizona'])
Out[15]:
7582

Geocoding Check: Texas

In [129]:
# Create a spatially enabled dataframe for TX
%time tx_df = provider_data_layer.query(where="user_Region='TX'", as_df=True)
tx_df.shape
Wall time: 4min
Out[129]:
(365783, 93)
In [130]:
# Look at data points that are geocoded outside of Texas
len(tx_df[tx_df['Region']!='Texas'])
Out[130]:
17

Heat Map - All Providers

Heatmap of Providers in U.S.

To visualize how healthcare providers are distributed, let's do a heatmap of providers in the United States.

In [32]:
map_usa = gis.map('USA')
map_usa
In [70]:
# Add provider data and create a heatmap
# usa_layer = FeatureLayer("https://datascienceqa.esri.com/server/rest/services/Hosted/provider_clean_data_geocoded_6_19/FeatureServer")
renderer = {"renderer": "autocast", #This tells python to use JS autocasting
            "type": "heatmap",
            "blurRadius":1,  # changes the size of the clusters
            "maxPixelIntensity":2,
            "minPixelIntensity":0,
            "field":None}
renderer["colorStops"] = [{"ratio":0,"color":[63, 40, 102, 0]},
                          {"ratio":0.25,"color":[167,97,170,179]},
                          {"ratio":0.50,"color":"#7b3ce9"},
                          {"ratio":0.75,"color":[222,102,0,179]},
                          {"ratio":1,"color":[244,204,0,179]}]
#                           {"ratio":1,"color":"#ffff00"}]
map_usa.add_layer(provider_data_layer,
               { "type": "FeatureLayer",
                 "renderer": renderer,
                })
In [117]:
# Add Legend
map_usa.legend = True
In [120]:
# Remove layers
map_usa.remove_layers()
Out[120]:
True
In [119]:
map_usa.take_screenshot(set_as_preview=True)

The map shows distribution of providers throughout the US. We can see large gaps in Nevada, Idaho, Utah, Wyoming and Texas. These and other states can be explored further for analysis.

Gather and Process Data

Create County Dataframe from Demographics Data

Here, we will search for demographic data at the county level. To do this, we will:

  • Search for Population density feature layer collection from Esri Living Atlas
  • Select the layer for county level data
  • Create a dataframe of county level demographic data by specifying the columns we need for our analysis
In [9]:
# Search for Population data layer
popsearch_result = gis.content.search('title: 2018 USA Population Density')
popsearch_result
Out[9]:
[<Item title:"2018 USA Population Density" type:Map Image Layer owner:esri_livingatlas>]
In [10]:
# Get Population Density
popdensity = popsearch_result[0]
popdensity
Out[10]:
2018 USA Population Density
This layer shows the population density in the United States in 2018 in persons per square mile in a multiscale map by country, state, county, ZIP Code, tract, and block group. ArcGIS Online subscription required.Map Image Layer by esri_livingatlas
Last Modified: September 05, 2019
0 comments, 0 views
In [6]:
# Check first 5 layers in population Density
popdensity.layers[:5]
Out[6]:
[<FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/ab4e1996d588405d9cd68348ef660f70/rest/services/USA_Demographics_and_Boundaries_2018/MapServer/0">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/ab4e1996d588405d9cd68348ef660f70/rest/services/USA_Demographics_and_Boundaries_2018/MapServer/1">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/ab4e1996d588405d9cd68348ef660f70/rest/services/USA_Demographics_and_Boundaries_2018/MapServer/2">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/ab4e1996d588405d9cd68348ef660f70/rest/services/USA_Demographics_and_Boundaries_2018/MapServer/3">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/ab4e1996d588405d9cd68348ef660f70/rest/services/USA_Demographics_and_Boundaries_2018/MapServer/4">]
In [8]:
# Look at first few field names for county layer
county_layer = popdensity.layers[46]
print('FIELD NAME', '\t\t', 'FIELD ALIAS')
for field in county_layer.properties.fields[:10]:
    print(field['name'], '\t\t', field['alias'])
FIELD NAME 		 FIELD ALIAS
OBJECTID 		 OBJECTID
Shape 		 Shape
ID 		 ID
NAME 		 NAME
STATE_NAME 		 STATE_NAME
ST_ABBREV 		 ST_ABBREV
AREA 		 Area in Square Miles (Calculated)
TOTPOP_CY 		 2018 Total Population (Esri)
HHPOP_CY 		 2018 Household Population (Esri)
FAMPOP_CY 		 2018 Family Population (Esri)
In [9]:
# Get specific attributes for Counties
%time
county_layer = popdensity.layers[46]
county_df = pd.DataFrame()
offset = 0
while offset <= 3000:
    chunk_df = county_layer.query(out_fields=['Shape','ST_ABBREV','NAME','ASIAN_CY','AMERIND_CY','AVGHHSZ_CY','AVGHINC_CY','BLACK_CY','EDUCBASECY','HISPPOP_CY',
                          'MEDAGE_CY','MINORITYCY','OTHRACE_CY','PCI_CY','POPDENS_CY','UNEMPRT_CY','WHITE_CY','SMCOLL_CY',
                          'ASSCDEG_CY','BACHDEG_CY','GRADDEG_CY','TOTPOP_CY'],return_all_records=False,result_offset=offset,result_record_count=750,as_df=True)
    county_df = pd.concat([chunk_df, county_df], ignore_index=True)
    
    offset += 750
Wall time: 0 ns
In [10]:
county_df.shape
Out[10]:
(3142, 23)

Create County Dataframe from Expenditure Data

Here, we will search for health expenditure data at the county level. To do this, we will:

  • Search for layers related to healthcare spendings from Esri Living Atlas
  • Select the Health Insurance Spending layer for county level data
  • Create a dataframe of county level health expenses data by specifying the columns we need for our analysis
In [11]:
# Search for Population data layer
expsearch_result = gis.content.search('title: 2018 USA Health Insurance Spending')
expsearch_result
Out[11]:
[<Item title:"2018 USA Health Insurance Spending" type:Map Image Layer owner:esri_livingatlas>,
 <Item title:"2018 USA High Credit Card Expenditures" type:Map Image Layer owner:esri_livingatlas>]
In [12]:
# Get Healthcare expenditure data
health_exp = expsearch_result[0]
health_exp
Out[12]:
2018 USA Health Insurance Spending
This layer shows the average amount spent on health insurance per household in the U.S. in 2018, by country, state, county, ZIP Code, tract, and block group. ArcGIS Online subscription required.Map Image Layer by esri_livingatlas
Last Modified: July 19, 2019
0 comments, 2 views
In [13]:
# Check first few layers in population Density
health_exp.layers[:5]
Out[13]:
[<FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/3fe029d089124e5dab518fd2c4f7eabe/rest/services/USA_Consumer_Expenditures_2018/MapServer/0">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/3fe029d089124e5dab518fd2c4f7eabe/rest/services/USA_Consumer_Expenditures_2018/MapServer/1">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/3fe029d089124e5dab518fd2c4f7eabe/rest/services/USA_Consumer_Expenditures_2018/MapServer/2">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/3fe029d089124e5dab518fd2c4f7eabe/rest/services/USA_Consumer_Expenditures_2018/MapServer/3">,
 <FeatureLayer url:"https://datascienceqa.esri.com/portal/sharing/servers/3fe029d089124e5dab518fd2c4f7eabe/rest/services/USA_Consumer_Expenditures_2018/MapServer/4">]
In [14]:
# Look at the fields and their data types
health_exp_county_layer = health_exp.layers[46]
print('FIELD NAME', '\t', 'FIELD ALIAS')
for f in health_exp_county_layer.properties.fields[:5]:
    print(f['name'], '\t',f['alias'])
FIELD NAME 	 FIELD ALIAS
OBJECTID 	 OBJECTID
Shape 	 Shape
ID 	 ID
NAME 	 NAME
STATE_NAME 	 STATE_NAME
In [15]:
# # Get specific attributes for Counties
health_exp_county_layer = health_exp.layers[46]

health_exp_county_df = pd.DataFrame()
offset = 0
while offset <= 3000:
    chunk_df = health_exp_county_layer.query(out_fields=['ST_ABBREV','NAME','X8001_A','X8002_A','X8013_A','X8018_A','X8019_A','X8024_A','X8032_A','X13002_A','X13004_A'], return_all_records=False,result_offset=offset,result_record_count=500,as_df=True)
    health_exp_county_df = pd.concat([chunk_df, health_exp_county_df], ignore_index=True)
    offset += 500
In [16]:
health_exp_county_df.shape
Out[16]:
(3142, 13)

Spatially Join Demographic and Health Expenditure Data

We will spatially join both the demographic and expenditure layer so as to combine data into a single layer.

In [17]:
# Merge demographic and health expenditure data at county level

county_healthexp_df = county_df.spatial.join(health_exp_county_df,how='left', op='within')
In [18]:
# Check geometry type
county_healthexp_df.spatial.geometry_type
Out[18]:
['polygon']
In [19]:
# Check shape of data
county_healthexp_df.shape
Out[19]:
(3142, 36)
In [20]:
# Write the data into a file geodatabase
county_healthexp_df.spatial.to_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_layer')
Out[20]:
'C:\\Users\\mohi9282\\Desktop\\arcgis\\demographic_healthexp.gdb\\demographic_healthexp_layer'

This feature class can now be published as a feature service to the portal using ArcGIS Pro. The feature service can be used for any further analysis. We will publish this layer as demographic_healthexp_layer.

Check Merge Results

In [863]:
# Random check data from combined layer
county_healthexp_df[(county_healthexp_df['ST_ABBREV_left']=='NM') & (county_healthexp_df['NAME_left']=='Chaves County')]
Out[863]:
OBJECTID_left ST_ABBREV_left NAME_left ASIAN_CY AMERIND_CY AVGHHSZ_CY AVGHINC_CY BLACK_CY EDUCBASECY HISPPOP_CY MEDAGE_CY MINORITYCY OTHRACE_CY PCI_CY POPDENS_CY UNEMPRT_CY WHITE_CY SMCOLL_CY ASSCDEG_CY BACHDEG_CY GRADDEG_CY SHAPE index_right OBJECTID_right ST_ABBREV_right NAME_right X8001_A X8002_A X8013_A X8018_A X8019_A X8024_A X8032_A X13002_A X13004_A
1189 1798 NM Chaves County 542 972 2.74 54751 1222 42002 38232 35.5 41000 16119 19980 10.9 5.2 44914 9912 3752 5830 2698 {'rings': [[[-11571218.0164, 4039879.771099999... 1439 1798 NM Chaves County 4034.11 2645.58 480.73 1388.53 737.0 43.09 102.35 255.81 5075.13
In [115]:
# Confirm results with County demographic data
county_df[(county_df['ST_ABBREV']=='NM') & (county_df['NAME']=='Chaves County')]
Out[115]:
OBJECTID ST_ABBREV NAME ASIAN_CY AMERIND_CY AVGHHSZ_CY AVGHINC_CY BLACK_CY EDUCBASECY HISPPOP_CY MEDAGE_CY MINORITYCY OTHRACE_CY PCI_CY POPDENS_CY UNEMPRT_CY WHITE_CY SMCOLL_CY ASSCDEG_CY BACHDEG_CY GRADDEG_CY SHAPE
1189 1798 NM Chaves County 542 972 2.74 54751 1222 42002 38232 35.5 41000 16119 19980 10.9 5.2 44914 9912 3752 5830 2698 {'rings': [[[-11571218.0164, 4039879.771099999...
In [116]:
# Confirm results with health expenditure data
health_exp_county_df[(health_exp_county_df['ST_ABBREV']=='NM') & (health_exp_county_df['NAME']=='Chaves County')]
Out[116]:
OBJECTID ST_ABBREV NAME X8001_A X8002_A X8013_A X8018_A X8019_A X8024_A X8032_A X13002_A X13004_A SHAPE
1439 1798 NM Chaves County 4034.11 2645.58 480.73 1388.53 737.0 43.09 102.35 255.81 5075.13 {'rings': [[[-11571218.0164, 4039879.771099999...

Get Provider Count - Aggregate Data

Now, we would like to get the provider count for each county. Provider count is a key variable and we can get the count in multiple ways:

- Method 1: We can query the provider data layer, store results in a spatially enabled dataframe and generate counts.
- Method 2: Use `AggregatePoints` tool from the GeoAnalytics Desktop Toolbox.

Method 1 fails due to the size and memory usage required for our data. This is where the power of GeoAnalytics Toolkit shines.

AggregatePoints tool helps us aggregate point data (providers) onto polygon data (counties). The output generated is a polygon layer with some aggregated statistics.

We will run AggregatePoints with provider data as our point layer and demographic_healthexp_layer as the polygon layer. In simple words, this tool will grab all the providers in a particular county and give us a count of providers for that county along with other demographic and health expenditure data that was in the polygon layer for counties.

Learn more about GeoAnalytics Desktop.

In [241]:
import arcpy
In [243]:
# Run Aggregate Points tool
arcpy.gapro.AggregatePoints(r"C:\Users\mohi9282\Documents\ArcGIS\Projects\GWR test\npi_provider.gdb\allstates_provider",
                            r"C:\Users\mohi9282\Desktop\arcgis\aggregate_points.gdb\agg_points_NB", "POLYGON",
                            r"C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_layer",
                            '', None, None, None, None, None)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
c:\users\mohi9282\appdata\local\programs\arcgis\pro\Resources\ArcToolbox\toolboxes\GeoAnalytics Desktop Tools.tbx#AggregatePoints_gapro.InitializeParameters.py in <module>

AttributeError: 'ToolValidator' object has no attribute 'isLicensed'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
c:\users\mohi9282\appdata\local\programs\arcgis\pro\Resources\ArcToolbox\toolboxes\GeoAnalytics Desktop Tools.tbx#AggregatePoints_gapro.InitializeParameters.py in <module>

AttributeError: 'ToolValidator' object has no attribute 'isLicensed'
Out[243]:
<Result 'C:\\Users\\mohi9282\\Desktop\\arcgis\\aggregate_points.gdb\\agg_points_NB'>

This aggregated data layer can now be published as a feature service to the portal and can be used for further analysis. We will publish this layer as demo_healthexp_allproviders.

Get Aggregated Data Layer

Get the aggrageted data layer that includes provider count and other demographics data

In [3]:
# Search for the layer
allsearch_result = gis.content.search('title: demo_healthexp_allproviders')
allsearch_result
Out[3]:
[<Item title:"demo_healthexp_allproviders" type:Feature Layer Collection owner:portaladmin>,
 <Item title:"demo_healthexp_allproviders" type:Service Definition owner:portaladmin>]
In [4]:
# Get the layer
allprovider = allsearch_result[0]
allprovider
Out[4]:
demo_healthexp_allproviders
Provider count, demographic and health expenditure data with UNCLEANED column namesFeature Layer Collection by portaladmin
Last Modified: September 13, 2019
0 comments, 1 views
In [5]:
allprovider.layers
Out[5]:
[<FeatureLayer url:"https://datascienceqa.esri.com/server/rest/services/Hosted/demo_healthexp_allproviders/FeatureServer/0">]
In [7]:
# Look at the top 5 fields in the layer
allprovider_layer = allprovider.layers[0]
for f in allprovider_layer.properties.fields[:5]:
    print(f['name'], '\t',f['alias'])
objectid1 	 OBJECTID1
objectid 	 OBJECTID
objectid_left 	 OBJECTID_left
st_abbrev_left 	 ST_ABBREV_left
name_left 	 NAME_left
In [8]:
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape
Out[8]:
(3139, 41)
In [9]:
# Look at the first few rows of dataframe
allprovider_df.head()
Out[9]:
objectid1 objectid objectid_left st_abbrev_left name_left asian_cy amerind_cy avghhsz_cy avghinc_cy black_cy educbasecy hisppop_cy medage_cy minoritycy othrace_cy pci_cy popdens_cy unemprt_cy white_cy smcoll_cy asscdeg_cy bachdeg_cy graddeg_cy totpop_cy index_right objectid_right st_abbrev_right name_right x8001_a x8002_a x8013_a x8018_a x8019_a x8024_a x8032_a x13002_a x13004_a count SHAPE__Length SHAPE__Area SHAPE
0 1 2089 1197 MD Calvert County 1701 423 2.85 118378 12062 65337 3945 41.0 20664 886 41430 440.2 5.6 75625 16211 5020 10961 8931 93836 1838 1197 MD Calvert County 7896.60 5227.44 767.21 2669.16 1497.17 83.06 157.57 592.98 11766.62 1044.0 315375.798068 9.139007e+08 {'rings': [[[-8529642.2486, 4688391.039899997]...
1 2 2321 1429 MS Issaquena County 6 3 2.40 38216 895 977 8 37.9 913 2 16066 3.4 5.2 494 199 29 50 45 1403 2070 1429 MS Issaquena County 3310.96 2149.61 475.83 1161.34 579.97 36.65 102.27 191.88 3298.61 5.0 387524.213325 1.526209e+09 {'rings': [[[-10120538.0346, 3896326.054200001...
2 3 1509 2117 OH Seneca County 421 168 2.47 63225 1422 38284 2977 40.0 5873 882 24981 101.8 5.5 51908 7627 4309 3811 2153 56104 758 2117 OH Seneca County 4698.76 3081.77 547.19 1616.99 861.89 51.40 114.41 313.28 5933.05 621.0 207757.548436 2.522762e+09 {'rings': [[[-9221715.5229, 5050074.560699999]...
3 4 2785 393 GA Banks County 258 69 2.77 60360 558 13282 1307 40.4 2464 759 21821 81.6 3.1 16894 2043 855 990 574 18940 3034 393 GA Banks County 4625.75 3042.93 557.61 1582.82 822.59 47.81 122.89 289.06 5640.16 79.0 124819.289536 8.908702e+08 {'rings': [[[-9290699.0981, 4093591.7993], [-9...
4 5 2901 509 GA Rockdale County 1727 307 2.81 73754 49467 60571 9491 37.8 62154 4497 26350 692.1 6.1 31300 11944 5603 11105 5462 89832 2150 509 GA Rockdale County 5085.66 3356.61 520.50 1729.05 948.00 54.48 109.17 344.50 7113.68 1210.0 117646.898958 4.963801e+08 {'rings': [[[-9341355.0324, 3994505.8356000036...

We can see that the dataframe (allprovider_df) has 3139 rows and 41 columns.

Clean Data

Here we will:

  • Check for Null values
  • Remove any duplicate columns
  • Rename columns
In [10]:
# Check for null values
allprovider_df.isnull().sum()
Out[10]:
objectid1          0
objectid           0
objectid_left      0
st_abbrev_left     0
name_left          0
asian_cy           0
amerind_cy         0
avghhsz_cy         0
avghinc_cy         0
black_cy           0
educbasecy         0
hisppop_cy         0
medage_cy          0
minoritycy         0
othrace_cy         0
pci_cy             0
popdens_cy         0
unemprt_cy         0
white_cy           0
smcoll_cy          0
asscdeg_cy         0
bachdeg_cy         0
graddeg_cy         0
totpop_cy          0
index_right        0
objectid_right     0
st_abbrev_right    0
name_right         0
x8001_a            0
x8002_a            0
x8013_a            0
x8018_a            0
x8019_a            0
x8024_a            0
x8032_a            0
x13002_a           0
x13004_a           0
count              0
SHAPE__Length      0
SHAPE__Area        0
SHAPE              0
dtype: int64
In [11]:
# Check for duplicate columns
allprovider_df.columns
Out[11]:
Index(['objectid1', 'objectid', 'objectid_left', 'st_abbrev_left', 'name_left',
       'asian_cy', 'amerind_cy', 'avghhsz_cy', 'avghinc_cy', 'black_cy',
       'educbasecy', 'hisppop_cy', 'medage_cy', 'minoritycy', 'othrace_cy',
       'pci_cy', 'popdens_cy', 'unemprt_cy', 'white_cy', 'smcoll_cy',
       'asscdeg_cy', 'bachdeg_cy', 'graddeg_cy', 'totpop_cy', 'index_right',
       'objectid_right', 'st_abbrev_right', 'name_right', 'x8001_a', 'x8002_a',
       'x8013_a', 'x8018_a', 'x8019_a', 'x8024_a', 'x8032_a', 'x13002_a',
       'x13004_a', 'count', 'SHAPE__Length', 'SHAPE__Area', 'SHAPE'],
      dtype='object')

We can see that the OBJECTID, ST_ABBREV AND index columns have duplicates. We will drop these duplicate columns from our data.

In [12]:
# Drop duplicate columns
allprovider_df.drop(['objectid1','objectid','index_right','objectid_right','st_abbrev_right','name_right',
                    'SHAPE__Length', 'SHAPE__Area'], axis=1, inplace=True)
allprovider_df.columns
Out[12]:
Index(['objectid_left', 'st_abbrev_left', 'name_left', 'asian_cy',
       'amerind_cy', 'avghhsz_cy', 'avghinc_cy', 'black_cy', 'educbasecy',
       'hisppop_cy', 'medage_cy', 'minoritycy', 'othrace_cy', 'pci_cy',
       'popdens_cy', 'unemprt_cy', 'white_cy', 'smcoll_cy', 'asscdeg_cy',
       'bachdeg_cy', 'graddeg_cy', 'totpop_cy', 'x8001_a', 'x8002_a',
       'x8013_a', 'x8018_a', 'x8019_a', 'x8024_a', 'x8032_a', 'x13002_a',
       'x13004_a', 'count', 'SHAPE'],
      dtype='object')
In [13]:
# Rename Columns
allprovider_df.rename(columns={'x8001_a':'avg_healthcare','x8002_a':'avg_healthinsurance','x8013_a':'avg_medicare',
                                  'x8018_a':'avg_medicalcare','x8019_a':'avg_medicalsrvc','x8024_a':'avg_labtest',
                                  'x8032_a':'avg_presdrug','x13002_a':'avg_personalinsurance','x13004_a':'avg_socsecurity',
                                  'asian_cy':'asian_pop','amerind_cy':'amerind_pop','avghhsz_cy':'avg_hhsz',
                                  'avghinc_cy':'avg_hhinc','black_cy':'black_pop','educbasecy':'edubase',
                                  'hisppop_cy':'hisp_pop','medage_cy':'median_age','minoritycy':'minority_pop',
                                  'othrace_cy':'otherace_pop','pci_cy':'percap_income','popdens_cy':'pop_density',
                                  'unemprt_cy':'unemp_rate','white_cy':'white_pop','smcoll_cy':'some_college',
                                  'asscdeg_cy':'asso_deg','bachdeg_cy':'bach_deg','graddeg_cy':'grad_deg','totpop_cy':'total_population',
                                  'st_abbrev_left':'state','name_left':'county','count':'provider_count','objectid_left':'objectid'}, inplace=True)
allprovider_df.columns
Out[13]:
Index(['objectid', 'state', 'county', 'asian_pop', 'amerind_pop', 'avg_hhsz',
       'avg_hhinc', 'black_pop', 'edubase', 'hisp_pop', 'median_age',
       'minority_pop', 'otherace_pop', 'percap_income', 'pop_density',
       'unemp_rate', 'white_pop', 'some_college', 'asso_deg', 'bach_deg',
       'grad_deg', 'total_population', 'avg_healthcare', 'avg_healthinsurance',
       'avg_medicare', 'avg_medicalcare', 'avg_medicalsrvc', 'avg_labtest',
       'avg_presdrug', 'avg_personalinsurance', 'avg_socsecurity',
       'provider_count', 'SHAPE'],
      dtype='object')
In [14]:
allprovider_df.shape
Out[14]:
(3139, 33)

Plot Spatially Enabled Data

In [15]:
allprovider_df.spatial.plot()

This map shows all countiues with demographic and health expenditure information for each county.

In [247]:
# Write the data into a file geodatabase
allprovider_df.spatial.to_featureclass(r'C:\Users\mohi9282\Desktop\arcgis\demographic_healthexp.gdb\demographic_healthexp_clean_allproviders')
Out[247]:
'C:\\Users\\mohi9282\\Desktop\\arcgis\\demographic_healthexp.gdb\\demographic_healthexp_clean_allproviders'

Plot People to Provider Ratio - US

Get Cleaned Data

In [11]:
# Search for the data layer
allsearch_result = gis.content.search('title: demographic_healthexp_clean_allproviders')
allsearch_result
Out[11]:
[<Item title:"demographic_healthexp_clean_allproviders" type:Service Definition owner:portaladmin>,
 <Item title:"demographic_healthexp_clean_allproviders" type:Feature Layer Collection owner:portaladmin>]
In [12]:
# Get the layer
allprovider = allsearch_result[1]
allprovider
Out[12]:
demographic_healthexp_clean_allproviders
This layer contains provider count, cleaned demographic and health expenditure data.Feature Layer Collection by portaladmin
Last Modified: September 13, 2019
0 comments, 1 views
In [13]:
allprovider_layer = allprovider.layers[0]
for f in allprovider_layer.properties.fields[:5]:
    print(f['name'], '\t',f['alias'])
objectid 	 OBJECTID
state 	 state
county 	 county
asian_pop 	 asian_pop
amerind_pop 	 amerind_pop
In [14]:
# Store data from layer as a spatially enabled dataframe
allprovider_df = allprovider_layer.query(as_df=True)
allprovider_df.shape
Out[14]:
(3139, 35)
In [15]:
# Create a new column - People per Provider
allprovider_df['people_per_prov'] = allprovider_df['total_population']/allprovider_df['provider_count']

Plot People per Provider

In [78]:
allprovider_map = gis.map('USA', 4)
allprovider_map

From this map, we can see that:

  • Counties in Westen and North Eastern US seem to have the best people to provider ratio with 0-100 people per healthcare provider.
  • Areas in North Central US seem to have a mix of both low and high number of people per provider. Nebraska stands out with Thomas, Logan, Loup, Hayes and Wheeler counties where number of people per healthcare provider is > 500.
  • Counties in Southern states seem to suffer the most where a majority of counties have 100-200 people per provider or higher.
  • Texas seems to stand out where counties like Borden, Glasscock, Irion, San Jacinto, Newton have > 500 people per healthcare provider. We can also see more counties where number of people per provider is high.
Define Renderer
In [79]:
# Define Renderer
allProvTest = {"renderer": {
                 "type": "classBreaks",  
                 "field":"people_per_prov",
                 "transparency":.5,
                 "minValue":1}}
In [89]:
# Define Manual Class breaks
allProvTest['renderer']["classBreakInfos"] = [{
  "classMaxValue": 100.00,
  "label": "0 - 100.00",
  "description": "0 - 100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [255,247,236,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 200.00,
  "label": "100.001 - 200.00",
  "description": "100.001 - 200.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [253,220,174,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 350.00,
  "label": "200.001 - 350.00",
  "description": "200.001 - 350.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [252,177,123,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 500.00,
  "label": "350.001 - 500.00",
  "description": "350.001 - 500.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [241,109,75,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}, {
  "classMaxValue": 1100.00,
  "label": "500.001 - 1100.00",
  "description": "500.001 - 1100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [200,28,18,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 0.05
    }
  }
}]
In [90]:
# Plot Map using defined Renderer
allprovider_map.remove_layers()
allprovider_df.spatial.plot(map_widget=allprovider_map, renderer=allProvTest['renderer'])
Out[90]:
True
In [82]:
allprovider_map.legend=True
In [92]:
allprovider_map.take_screenshot()

An alternate way to create the map without define a renderer is to directly plot the spatial dataframe by using Esri's classification algorithm. Here we show the use of esriClassifyNaturalBreaks which breaks the column values naturally into different classes.

In [68]:
# allprovider_df.spatial.plot(map_widget=allprovider_map,
#         renderer_type='c',  # for class breaks renderer
#         method='esriClassifyNaturalBreaks',  # classification algorithm
#         class_count=6,  # choose the number of classes
#         col='people_per_prov',  # numeric column to classify
#         cmap='OrRd',  # color map to pick colors from for each class
#         alpha=0.7 ,  # specify opacity
#         linewidth = 0.05)
Out[68]:
True

The Lone Star State - Exploring Texas

The second largest state in U.S. both by area and population, Texas baosts of 261,231.71 sq mi land area with a population of ~28.7 million. The population density is low with just 105.2 people per square mile and varies drastically between heavely populated urban areas to sparsely populated rural.

Image source: https://www.npr.org/sections/thetwo-way/2018/05/24/614195884/new-census-data-shows-texas-cities-are-growing-faster-than-all-other-states

As seen in the previous notebook, Texas has the fourth largest number of healthcare providers in U.S. However, the state stands second highest on Health Resources and Services Administration's (HRSA) list of shortage areas. Texas also tops HRSA's list of medically underserved areas.

Let's start our journey with exploring Texas!

Get the Data Layer

In [16]:
# Store data from layer as a spatially enabled dataframe
txprovider_df = allprovider_layer.query(where="state='TX'", out_fields=['objectid','state','county','avg_hhinc',
                                                                            'median_age','total_population','provider_count'],as_df=True)
txprovider_df.shape
Out[16]:
(253, 8)

Provider and Population Density

Let's explore distribution of providers with respect to population density.

In [17]:
txprovider_df['people_per_prov'] = txprovider_df['total_population']/txprovider_df['provider_count']
In [12]:
tx_pop_map = gis.map('Texas, USA', zoomlevel=4)
tx_pop_map

From this map, we can see that

  1. Irion County (in red) seems to be the worst with 799 people per healthcare provider.
  2. Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.
  3. There are very few counties (in white) in Texas with less than 100 people per healthcare provider.
Define Renderer
In [13]:
# Define Renderer
txpopTest = {"renderer": { #This tells python to use JS autocasting
                 "type": "classBreaks",  
                 "field":"people_per_prov",
                 "transparency":.5,
                 "minValue":1}}
In [14]:
# Define Manual Class breaks
txpopTest['renderer']["classBreakInfos"] = [{
  "classMaxValue": 100.00,
  "label": "0 - 100.00",
  "description": "0 - 100.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [255,247,236,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 300.00,
  "label": "100.001 - 300.00",
  "description": "100.001 - 300.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [253,220,174,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 500.00,
  "label": "300.001 - 500.00",
  "description": "300.001 - 500.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [252,177,123,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 700.00,
  "label": "500.001 - 700.00",
  "description": "500.001 - 700.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [241,109,75,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}, {
  "classMaxValue": 800.00,
  "label": "700.001 - 800.00",
  "description": "700.001 - 800.00",
  "symbol": {
    "type": "esriSFS",
    "style": "esriSFSSolid",
    "color": [200,28,18,178.5],
    "outline": {
      "style": "esriSLSSolid",
      "type": "esriSLS",
      "color": [128,128,128,255],
      "width": 2
    }
  }
}]
Plot Map
In [15]:
# Plot Map using defined Renderer
tx_pop_map.remove_layers()
txprovider_df.spatial.plot(map_widget=tx_pop_map, renderer=txpopTest['renderer'])
Out[15]:
True
In [14]:
#### FOR AUTOMATIC RENDERING - manual class breaks are not used

# txprovider_df.spatial.plot(map_widget=tx_pop_map,
#         renderer_type='c',  # for class breaks renderer
#         method='esriClassifyNaturalBreaks',  # classification algorithm
#         class_count=8,  # choose the number of classes
#         col='people_per_prov',  # numeric column to classify
#         cmap='OrRd',  # color map to pick colors from for each class
#         alpha=0.7 ,  # specify opacity
#        )
Out[14]:
True
In [16]:
tx_pop_map.legend=True
In [17]:
tx_pop_map.take_screenshot()
In [32]:
tx_pop_map.remove_layers()
Out[32]:
True

Provider and Avg. Household Income

Let's create a density plot to explore providers with respect to avg. household income.

In [22]:
g = sns.jointplot(x='people_per_prov', y='avg_hhinc', data=txprovider_df,xlim=[0,400], ylim=[35000,100000] ,kind='kde', n_levels=8)
g.fig.suptitle('Kernel Density Plot of People per Provider vs Avg. Household Income', y=1.05)
Out[22]:
Text(0.5, 1.05, 'Kernel Density Plot of People per Provider vs Avg. Household Income')

From this plot we can see that for majority of the counties in Texas the average number of people per healthcare provider vary from 100-200 and the average household income for these counties vary from 60,000 - 70,000 dollars.

Provider and Median Age

Let's create a density plot to explore providers with respect to median age.

In [23]:
g = sns.jointplot(y='people_per_prov', x='median_age', data=txprovider_df, ylim=[0,400], xlim=[20,60] ,kind='kde')
g.fig.suptitle('Kernel Density Plot of People per Provider vs Median Age', y=1.05)
Out[23]:
Text(0.5, 1.05, 'Kernel Density Plot of People per Provider vs Median Age')

From this plot we can see that for majority of the counties in Texas the median age is between 35-45. The average number of people per healthcare provider vary from 60 - 210 for these counties.

Summary

To summarize, we saw that:

  • Counties in Westen and North Eastern US seem to have the best people to provider ratio with 0-100 people per healthcare provider.
  • Counties in Southern states seem to suffer the most where a majority of counties have 100-200 people per provider or higher.
  • Texas seems to stand out where:
    • Irion County (in red) seems to be the worst with 799 people per healthcare provider.
    • Borden, Glasscock, San Jacinto, Newton are some counties (dark rust) with 500-700 people per healthcare provider.
    • There are very few counties (in white) in Texas with less than 100 people per healthcare provider.